Classifier Risk Estimation under Limited Labeling Resources
نویسندگان
چکیده
In this paper we propose strategies for estimating performance of a classifier when labels cannot be obtained for the whole test set. The number of test instances which can be labeled is very small compared to the whole test data size. The goal then is to obtain a precise estimate of classifier performance using as little labeling resource as possible. Specifically, we try to answer, how to select a subset of the large test set for labeling such that the performance of a classifier estimated on this subset is as close as possible to the one on the whole test set. We propose strategies based on stratified sampling for selecting this subset. We show that these strategies can reduce the variance in estimation of classifier accuracy by a significant amount compared to simple random sampling (over 65% in several cases). Hence, our proposed methods are much more precise compared to random sampling for accuracy estimation under restricted labeling resources. The reduction in number of samples required (compared to random sampling) to estimate the classifier accuracy with only 1% error is high as 60% in some cases.
منابع مشابه
Automatic Accent Annotation with Limited Manually Labeled Data
Annotating manually the accent labels of a large speech corpus is both tedious and time-consuming. In this paper we investigate automatic accent labeling procedure by using classifiers trained from limited manually labeled data. Different methods are proposed and compared in a framework of multi-classifiers, including: a linguistic classifier, an acoustic classifier and a combined one. The ling...
متن کاملTarget contrastive pessimistic risk for robust domain adaptation
In domain adaptation, classifiers with information from a source domain adapt to generalize to a target domain. However, an adaptive classifier can perform worse than a non-adaptive classifier due to invalid assumptions, increased sensitivity to estimation errors or model misspecification. Our goal is to develop a domain-adaptive classifier that is robust in the sense that it does not rely on r...
متن کاملDiscriminative Similarity for Clustering and Semi-Supervised Learning
Similarity-based clustering and semi-supervised learning methods separate the data into clusters or classes according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose a novel discriminative similarity learning framework which learns discriminative similarity for either data clustering or semi-supervised learning...
متن کاملDischarge Estimation by using Tsallis Entropy Concept
Flow-rate measurement in rivers under different conditions is required for river management purposes including water resources planning, pollution prevention, and flood control. This study proposed a new discharge estimation method by using a mean velocity derived from a 2D velocity distribution formula based on Tsallis entropy concept. This procedure is done based on several factors which refl...
متن کاملOn Integral Probability Metrics, φ-Divergences and Binary Classification
φ-divergences are a widely studied class of distance measures between probabilities. In this paper, a different class of distance measures on probabilities, called the integral probability metrics (IPMs) is considered. IPMs, for example, the Wasserstein distance and Dudley metric have, thus far, only been used in a limited setting, as theoretical tools in mass transportation problems, in metriz...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1607.02665 شماره
صفحات -
تاریخ انتشار 2016